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Chapter  1 

Theoretical  Development  of 
Gain  Modification 


1.1  Introduction 

We  present  a  backward  propagation  network  which  simultaneously  modifies 
the  gain  parameters  and  the  synaptic  weights.  The  additional  complexity 
is  minimized  by  the  fact  that  the  error  signal  for  modification  of  the  gain 
of  a  neuron  is  proportional  to  the  ordinary  error  signal  for  the  incoming 
synaptic  weights  to  the  neuron.  For  a  given  neuron,  the  proportionality 
factor  is  the  reciprocal  of  its  gain.  Thus,  only  the  ordinary  error  signal 
must  be  propagated.  The  input  to  a  synapse,  which  multiplies  the  error 
signal  in  the  standard  modification  rule  for  synapses,  is  replaced  by  the  net 
input  to  the  cell  in  the  gain  modification  rule.  We  also  demonstrate  that  our 
algorithm  can  be  viewed  as  a  gradient  descent  in  rescaled  synaptic  vectors 
with  effective  time-dependent  step-constants  which  depend  on  both  the 
magnitude  of  the  gains  and  the  magnitude  of  the  ordinary  synaptic  vectors. 
In  this  paper,  we  show  that  the  effect  of  gain  modification  by  this  algorithm 
can  be  used  to  enhance  the  improvements  in  convergence  rate  obtained 
by  the  use  of  high  momentum  in  ordinary  synaptic  modification.  These 
improvements  occur  without  degrading  the  generalization  capabilities  of 
the  final  solutions  obtained  by  the  network. 
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1.2  Review  of  Notation  for  Standard  Back¬ 
ward  Propagation 

We  begin  with  a  short  summary  of  the  notation  which  we  have  used  for 
backward  propagation  in  our  previous  work  (Bachmann,  1988)  [l].  The  no¬ 
tation  differs  somewhat  from  that  used  by  Rumelhart,  Hinton,  and  Williams 
(1986)  [6]  and  Werbos  (1988)  [7].  For  simplicity,  we  consider  a  three-layer 
network.  A  typical  network  is  illustrated  in  figure  1.1. 

Output  Layer 

Hidden"  Layer 

Input  Layer 


Figure  1.1:  The  o k  label  the  output  units,  the  hi  label  the  “hidden 
units” ,  and  the  /,•  label  the  input  units.  Only  some  of  the  connections 
are  shown.  Superscripts  on  synaptic  weights  denote  layer  index. 


The  feedforward  equations  are  defined  by: 

<  =  E +^" 

1=1 

(i.i) 

J  * 

K  = 

(1.2) 

vl  =  EA'+tf1 
•  =1 

(1.3) 

°k  =  ^(yt;A*2))» 

(1.4) 

o'k  is  the  output  of  the  kth  unit  in  the  output  layer,  h‘  is  the  output  of  the 
ith  unit  in  the  hidden  layer,  and  fj  is  the  value  of  the  jth  input,  s  denotes 
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the  pattern  index.  Additionally,  recall  that  y*k  is  the  total  input  to  the  kth 
output  unit,  and  x\  is  the  total  input  to  the  ith  hidden  unit.  4 is  the  bias 
for  the  ith  cell  in  the  hidden  layer,  and  4> is  the  bias  for  the  kth  cell  in 
the  output  layer.  ip(z\  A)  is  the  sigmoid  input-output  function  defined  by: 

^A)  =  (L5) 

We  have  introduced  explicitly  the  gains  A^  of  the  kth  cell  in  the  last  layer 
and  A^  for  the  ith  cell  in  the  hidden  layer.  1  2  £a,  the  energy  of  pattern  s, 
is  given  by: 

f.  =  E;M -’*)’•  (i.«) 

fc=l  1 

A  pattern  gradient,  which  is  used  to  modify  the  synaptic  weights  following 
the  presentation  of  each  pattern  to  the  network,  is  computed  from  the  en¬ 
ergy  The  backward  propagation  method  is  thus  only  an  approximation 
to  gradient  descent  since  the  true  Liapunov  function  is  really: 

m 

(  =  Ee. 

8=1 

1  m  ^2 

z  t=i  k=i 

To  make  a  connection  with  Rumelhart’s  delta  error  signal  notation 
(Rumelhart,  1988)  [6],  we  note  that  by  defining  partial  derivatives  with 
respect  to  the  “net” ,  or  net  input  to  a  cell,  we  have3: 

dy‘k 

do’k 
d°l  dVk 

Afw  -  r;W(l  -  ot)  (1.8) 

'In  Rumelhart’s  original  model  there  is  no  gain  parameter  A  (A  =  1).  Although  initially 
all  gains  are  set  to  1,  our  proposed  model  allows  the  gains  to  vary  individually. 

2Hopfield  (1984)  [3]  has  used  a  gain  parameter  in  the  continuous  version  of  his  model  to 
study  the  effect  of  changing  the  character  of  the  nonlinearity  of  the  input-output  function. 

3We  have  added  superscripts  to  the  error  signals  for  greater  clarity. 


(1.7) 
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_dL 

dx\ 

_  dyiw 

h  a°i  dvl  8A'  Szi 

-4%*(i  -  K)  £>;  -  r;)xi2,o;(i  -  «:)«-L2>- 


This  allows  one  to  write: 


(1.9) 


N, 


4”  =  -  K)  E  4M? 


(1.10) 


Jk=l 


The  synaptic  modifications  are  proportional  to  the  negative  gradient: 


a.(4”)  = 

A.f##1)  = 


36  dyj 
dy'k  3tuj^ 


dd. 


Bw) 


(i) 


d£.  dx’ 
&xi  duijp 


(1.11) 


(1.12) 


Therefore,  with  the  above  definitions  for  4”  and  \  equations  1.11  and 
1.12  become:  4 


A.(«-L!))  =  'iti'V  (1-13) 

a.'-;;1)  =  utf’//-  (i.i4) 

Note  that  equation  1.10  defines  the  backward  propagation  of  the  error  sig¬ 
nals.  Recall  also  that  in  standard  backward  propagation,  a  “momentum” 
term  is  often  used  at  each  step  of  the  modification  procedure  in  the  gradient 
descent  search  for  the  minimum.  Heuristically,  the  momentum  term  can 

4  As  in  Rumelhart  et.  al.  (1986)  [6],  the  biases  are  modified  by  the  same  procedure; 
however,  the  input  for  biases  is  defined  to  be  unity. 
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be  viewed  as  a  means  of  increasing  the  step-constant  when  the  curvature 
of  the  energy  surface  is  low  and  several  successive  modifications  have  the 
same  sign  (Jacobs,  1988)  [5].  The  momentum  term  consists  of  adding  a 
small  amount  proportional  to  the  previous  modification,  so  that  the  actual 
modification  at  time  step  t  for  the  nth  layer  of  synaptic  connections  is: 

A*(0(U,L"))  =  (1.15) 

where  s(t)  denotes  the  index  of  the  pattern  presented  at  time  step  t  and 
k  is  a  positive  constant  less  than  1.  One  can  also  write  this  in  a  slightly 
different  form: 


This  form  is  more  suggestive  in  showing  that  the  use  of  a  momentum  term 
is  equivalent  to  a  discrete  approximation  to  a  temporal  integral  average 
with  an  exponentially  decaying  kernel.  The  kernel  has  the  effect  of  em¬ 
phasizing  the  influence  of  the  patterns  most  recently  presented,  assigning 
exponentially  less  weight  to  those  patterns  presented  earlier  in  time. 

1.3  Simultaneous  Modification  of  Gain  Pa¬ 
rameters  and  Synapses  by  a  Backward 
Propagation  Algorithm 

In  this  section,  we  consider  the  possibility  of  modifying  the  gain  parame¬ 
ters  and  the  synaptic  weights  simultaneously.  To  accomplish  this,  we  have 
formulated  a  backward  propagation  procedure  which  modifies  the  gain  pa¬ 
rameters  in  the  network  in  a  manner  similar  to  the  method  used  for  the 
synaptic  weights.  The  procedure  can  take  advantage  of  quantities  already 
calculated  in  the  ordinary  backward  propagation  procedure  for  the  synaptic 
weights,  thus  minimizing  the  additional  complexity.  The  error  signal  for 
the  gain  of  a  particular  neuron  is  proportional  to  the  ordinary  synaptic  er¬ 
ror  signal  for  the  incoming  synaptic  weights  connected  to  the  neuron.  The 
proportionality  factor  is  just  the  reciprocal  of  the  neuronal  gain.  Addition¬ 
ally,  the  input  to  a  particular  synapse,  which  multiplies  the  error  signal  in 
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the  standard  synaptic  modification  formula,  is  replaced  by  the  net  input  to 
the  neuron  for  the  modification  of  its  gain. 

We  note  in  passing  that  other  work  has  been  done  on  gain  modification 
procedures,  for  example  Kruschke  (1988)  [4];  however,  Kruschke’s  proce¬ 
dure  does  not  use  the  gradient  with  respect  to  the  gain.  Rather,  gains  are 
modified  in  the  context  of  a  competitive  learning  scheme,  which  is  com¬ 
bined  with  backward  propagation  modification  of  the  synaptic  weights.  In 
contrast,  our  scheme  uses  a  backward  propagation  procedure  for  both  gains 
and  weights,  incorporating  respectively  the  gradients  with  respect  to  the 
gains  and  the  weights. 

To  derive  our  model,  we  begin  by  defining  rescaled  error  signals  7^  and 
7^  in  terms  of  the  error  signals  for  the  synaptic  weights  in  equations  1.8 
and  1.9: 

J»>  =  J_X(2) 

—  ^(2)  °sk 

\Wdyl 

1  d£,  do‘k 

d°‘k  dy'k 

=  -K  -  *fcK(i  -  cl)  (1.17) 


a|1}  dx; 


1  y,  d£t  doj  dyj  dh\ 

dok  dyt  dh\  dx’ 


*2 


-A?(  1  -  A?)  EK  -  -  o£)w£ 


(2) 


k=l 


(1.18) 


Given  these  definitions,  we  may  write  a  backward  propagation  equation  for 
the  rescaled  error  signals  7^  and  7,-^  by  combining  equations  1.17  and  1.18. 
Alternatively,  we  could  have  derived  this  formula  by  simply  replacing  6^ 
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(1.19) 


(1.20) 

(1.21) 

(1.22) 

(1.23) 

(1.24) 
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where  we  have  used  equations  1.21  and  1.22  to  write  the  derivatives  with 
respect  to  the  gains  in  equations  1.23  and  1.24  in  terms  of  derivatives  with 

respect  to  the  net  inputs  y‘k  and  x‘  to  the  kth  cell  in  the  output  layer 

and  the  ith  cell  in  the  hidden  layer  respectively.  Using  the  definitions  in 
equations  1.17  and  1.18,  we  can  recast  equations  1.23  and  1.24  as: 

A,a£2)  =  on^kVk  (1.25) 

AA-X)  =  *7 2W  (1-26) 

For  comparison  with  equations  1.13  and  1.14  describing  synaptic  modifica¬ 
tion,  we  note  that  in  equations  1.25  and  1.26  depicting  gain  modification, 
the  ordinary  error  signals  defined  in  equations  1.8  and  1.9  have  been  re¬ 
placed  by  the  rescaled  error  signals  defined  in  equations  1.17  and  1.18,  and 
the  signal  input  to  a  particular  incoming  synapse  (either  h\  or  / ■  depend¬ 
ing  on  the  layer)  has  been  replaced  by  the  corresponding  total  integrated 
potential  to  the  cell,  i.e.  y‘k  or  x\  depending  on  the  layer.  Since  the  rescaled 
error  signals,  qf,  are  proportional  to  the  ordinary  synaptic  error  signals,  S, 
we  need  only  propagate  the  ordinary  error  signals  6  and  obtain  the  rescaled 
error  signals  locally  by  dividing  by  the  cell  gain.  Equivalently,  we  can  view 
the  error  signal  as  a  quantity  which  initiates  changes  in  the  input-output 
characteristics  of  the  neuron  at  the  same  time  that  it  modifies  the  incoming 
synapses  to  the  neuron.  From  this  perspective,  the  modification  rule  for 
a  synapse  couples  the  ordinary  error  signal  and  the  input  to  the  incoming 
synapse  with  coupling  constant  rj,  while  the  gain  modification  rule  couples 
the  ordinary  error  signal  and  the  net  input  to  a  cell  with  coupling  constant 
-^y.  The  only  difference  is  the  strength  of  the  coupling,  and,  in  fact,  since 

tfie  gain  parameter  is  a  more  sensitive  parameter  than  the  synaptic  weights, 
we  choose  the  initial  coupling  for  the  gains,  a,  to  be  an  order  of  magnitude 
smaller  than  that  of  the  synapses  in  the  simulations  described  here.  5  Al¬ 
ternatively,  we  will  show  in  the  next  section  that  one  can  view  simultaneous 
gain  and  synaptic  modification  as  a  gradient  descent  in  rescaled  synapses 
with  effective  step-constants  which  are  dependent  on  the  gain  and  magni¬ 
tude  of  the  ordinary  synaptic  vectors  and  are  direction-dependent  in  the 

6The  gains  are  all  set  to  one  initially.  Therefore,  the  order  of  magnitude  of  the  initial 
coupling  is  determined  by  a.  Even  though  the  gains  are  chosen  to  be  initially  the  same, 
symmetry  is  still  broken  by  choosing  the  initial  synaptic  weights  randomly  in  a  small 
hypercube. 
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synaptic  vector  space.  From  this  perspective,  the  effective  step-constant 
is  now  also  time- dependent  through  its  dependence  on  the  gain  and  the 
magnitude  of  the  ordinary  synaptic  vectors. 


1.4  Gain  Modification  Viewed  as  an  Effec¬ 
tive  Time-Dependent  Step-Constant 


In  developing  our  theoretical  arguments,  it  will  be  easiest  to  consider  the 
effects  of  gain  modification  on  ordinary  backward  propagation  without  mo¬ 
mentum.  It  will  be  seen  that  the  addition  of  momentum  will  not  alter  the 
development  here.  To  simplify  matters,  we  adopt  a  notation  which  includes 
synaptic  weight  vectors  and  biases  in  the  same  vector.  We  also  change  the 
notation  to  a  slightly  more  general  form  than  that  used  in  sections  1.2  and 
1.3.  For  the  ith  cell  in  the  nth  layer,  6  we  define: 

wln)  =  (tzfU!"*) 

/(”)•'  =  (/»’*,  1)  (1.27) 


where  the  input  to  layer  n,  is  just  the  output  vector  of 

the  previous  layer.  7  The  jth  component  of  tvjn\  u>j”\  then  is  just  the 
connection  from  the  jth  cell  in  the  (n-l)th  layer  to  the  ith  cell  in  the  nth 
layer.  This  allows  us  to  write  the  output  for  pattern  s  of  the  ith  cell  in  the 
nth  layer  as: 

„(").»  1 


o'.  = 


1  +  e 


(1.28) 


In  the  case  of  nonmodifiable  gains  set  equal  to  one,  it  is  apparent  that  we 
may  view  the  norm  of  the  vector  as  the  gain.  This  follows  by  rewriting 
equation  1.28  as: 

(1.29) 


1  _j_  e— A jB)|wl")  |w/n)  •/<">•  - 


6  The  input  layer  is  defined  to  have  the  index  n  =  0. 

7To  compare  the  notation  for  neuron  activation  used  earlier  for  a  three  layer  system, 
/*,  h *,  o*,  with  the  current  notation,  we  see  that  the  inputs  to  the  network  would  be 
yi1)-*  =  o^°h*  =  /*,  hidden  unit  outputs  would  be  /!2b*  =  =  h‘ ,  and  the  last  layer 

output  would  be  /l3)'*  =  =  o*. 
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Here  W jn^  is  a  unit  vector  in  the  direction  of  .  This  leads  us  to  ask  if 
modifying  the  gains,  A|n^,  is  truly  advantageous.  To  answer  this  question, 
we  will  need  to  define  rescaled  synaptic  vectors  according  to: 


ufn)  ==  x\n)w[n) 


(1.30) 


Note  that  with  this  definition  equation  1.28  becomes: 


oin)'s  = 


1 


1  +  e 


— u " 


(1.31) 


Since  is  just  a  rescaling  of  the  synaptic  vector  W-7*1,  we  can  ask  how  this 
vector  changes  when  we  modify  and  A,-n^  separately.  In  particular,  we 
will  find  it  useful  to  consider  the  changes  in  along  the  direction  of  ujn^ 
and  those  in  the  hyperplane  perpendicular  to  ujn^.  We  denote  these  two 
components  by  (A,u^)||  and  (A4u^)_l.  We  also  define  v|n^  to  be  a  unit 
vector  in  the  direction  of  (Auf^)x  in  the  hyperplane  orthogonal  to  u^.  If 
the  changes  in  A|n^  and  W^n\  and  thus  in  are  sufficiently  small,  then 
we  may  write: 

A =  Aa(ASn)w/n))  «  Ajn)A ,w}n)  +  w/n)AaA,(n).  (1.32) 

The  change  A ,W^  can  be  rectified  into  two  orthogonal  components:  8 
A  .W}n)  = 

=  -r7(^(n)(^/n)-V#.(„,r)  +  ^n)(fi Jn)-V^,B,^))  (1.33) 


(») 


Note  that  AaA|n^  is  just: 


A,A,<n)  =  -O-^r 

dX\n) 


(1.34) 


However,  we  can  write  in  terms  of  the  component  of  V^<„)  £*  along 


Wj.  To  see  this,  observe  that  we  may  write: 

dC  _  dC  d(ujn)  •  /(">.«) 
dXln}  ~  d(ujn)  ■  /(")■*)  aA!") 


dC 


d(ujn)  ■  /(")•*) 


I  w}n)  I  w}n)  ■  F{n)'' 


(1.35) 


’Here  we  neglect  momentum. 
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At  the  same  time,  note  that: 


e 


d? 


d  j  w/n)  j 

dj*  d(ujn)  •  F (">■«) 

a(i4n)  •  fW'1)  a  |  w/n)  | 


BV 

d(uf°  •  /(")••) 


A (n)w}n)  •  /(n)'a 


From  equations  1.35  and  1.36,  we  obtain: 


d\\ 


(”) 


M/(n)  | 


(1.36) 


(1.37) 


Having  derived  equation  1.37,  we  can  now  combine  it  with  equations  1.33 
and  1.34  to  rewrite  equation  1.32  as: 


A  -#(r0  \  (n)  f  /l  l  /  I  l\2W-  (**)  T-T  £8\  ~  (n) 

A,u)  «  -r?A)  >{  (1  +  '  -(V)— )  )(“,-  ' 

+  (oSn)  •  }.  (1-38) 

where  we  have  used  the  fact  that  u\  1  =  .  This  allows  us  to  write: 


We  can  recast  this  equation  in  terms  of  gradients  with  respect  to  u|n  by 
converting  the  gradients  with  respect  to  W>  1  .  In  doing  so,  we  note  one 
subtlety  about  the  gradients  which  were  used  to  derive  equation  1.39.  The 
gradient,  V^(n)£*,  which  appears  in  ordinary  backward  propagation  (equa¬ 
tion  1.33,  and  therefore  in  equation  1.39,  is  a  vector  of  partial  deriva¬ 
tives  evaluated  at  constant  A|n^.  We  could  denote  this  explicitly  by  writing 
(V^V,.  Similarly,  in  the  modification  of  the  gains  (equation  1.34), 

is  calculated  as  a  partial  derivative  to  be  taken  at  constant  which 
we  could  write  as  In  fact,  equation  1.37  is  an  equation  relating 
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to  (V^(„)  £*)  <„).  Because  the  gradient  of  the  energy  with  respect 

to  (V^(„)£*)A(„),  is  calculated  for  fixed  A,-n\  we  may  easily  express  it 

in  terms  of  the  gradient  with  respect  to  u}  ,  (V_(„)£*)A(,,).  We  see  that  the 
jth  component  of  the  two  gradients  are  related  by: 


,  ae.  i  =  ,  a“S?’ ,  ,  a;. , 

dW}^  A-">  AwW  ^‘n>  A.(" 


_  >(«)/  d£s  > 

-  A« 

ou)/  • 


We  write  this  compactly  as: 


(?*.<■>  «V>  = 


(1.40) 


(1.41) 


This  allows  us  to  rewrite  equation  1.39  as: 


(  [1  +  ^  ° 
v  (a.«;  ')j.  /  Vo  i 


et'-v^c 

o,w-vJ.,e' 


(1.42) 

We  see  that  a  backward  propagation  algorithm  which  simultaneously 
modifies  the  synaptic  weights  and  the  gains  is  equivalent  to  a  gradient 
descent  in  the  rescaled  synapes  uj ^  with  time- dependent  step-constants. 
In  the  direction  of  the  rescaled  synaptic  vector,  the  step-constant  now  has 
a  quadratic  dependence  on  the  magnitudes  of  the  gain  and  the  ordinary 
synaptic  vector,  and  in  the  hyperplane  perpendicular  to  the  direction  of 
the  rescaled  synaptic  vector,  the  step-constant  has  a  quadractic  dependence 
only  on  the  magnitude  of  the  gain.  From  equation  1.42,  we  obtain: 


111  =  'l(\'"1)1  +  “  I  w}n)  I2 

’ll  =  it-*!”’)2. 


(1.43) 

(1.44) 


The  fact  that  the  step-constants  are  positive-definite  ensures  that  we 
always  take  steps  in  the  direction  opposite  the  gradient;  therefore,  we  can 
classify  this  approach  as  a  gradient  descent  algorithm.  Momentum  can  be 
added  to  the  above  argument  without  loss  of  generality,  since  it  simply  adds 
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a  small  amount  proportional  to  the  most  recent  change;  the  only  difference 
is  that  now  the  step-constants  are  time-dependent. 

We  note  that  the  derivation  of  our  model,  resulting  in  equation  1.42, 
depended  on  the  approximation  in  equation  1.32  that  we  take  sufficiently 
small  steps  in  changing  the  ordinary  synapses  and  gains.  If  the  synapses 
grow  to  be  very  large,  it  is  possible  that  this  approximation  may  break 
down.  However,  depending  on  our  choices  of  r\  and  a ,  the  approximation 
will  be  valid  for  at  least  part  of  the  evolution  of  the  network  and  certainly 
during  the  early  development.  In  a  continuum  model,  the  derivation  would 
be  exact,  provided  that  we  replace  the  single  pattern  energy  £,  with  the 
total  energy  over  all  patterns  £  =  £«•  For  such  a  continuum  model,  we 

would  replace  equation  1.32  with: 


dt  dt  ‘  dt  *  dt  ' 

Similarly,  we  would  replace  equation  1.33  with: 


(1.45) 


dW$n) 

dt 


-  — 

F  t 

=  -V (H'fW  •  Vtfwf)  +  «,W(«!n)  •  V^c.,0) 


(1.46) 


and  equation  1.34  with: 


In  equations  1.35,  1.36,  and  1.37,  we  would  need  to  sum  both  sides  of  the 
equation  over  the  pattern  index  s,  to  obtain  the  appropriate  relationships 
for  the  total  energy  £.  In  the  end,  we  would  obtain  an  exact  nonlinear 
differential  equation  of  the  form: 


dA<"> 


=  —a- 


%  fnl 


(1.47) 


V  dt  )  II 

7<" 


v  dt  I- 


=  -^y(l+^)2  0 
V  o  i 


(1.48) 


from  which  we  would  obtain  the  effective  time-dependent  step-constants  in 
equations  1.43  and  1.44. 
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1.5  Impact  of  the  Effective  Time-Dependent 
Step-Constant 


In  the  previous  section,  we  derived  the  form  of  the  effective  step-constant 
and  determined  that  grows  quadratically  as  a  function  of  the  magnitudes 
of  the  ordinary  synaptic  vector  and  the  gain  and  that  the  step-constant 
in  the  direction  of  change  of  u\n\  grows  quadratically  with  the  gain.  It  may 
be  particularly  important  in  the  early  stages  of  synaptic  development  when 
uff  •  /(")■'  is  small  and  can  be  approximated  linearly  as:  9 


1  +  e 


-a(.n)  •/<")•• 


2  -  ujn)  •  £(")•* 

l.  rfn)-  f(n 


-  §<1+- 


> 


(1.49) 


In  this  regime,  therefore,  the  error  signals  can  be  approximated  as  poly¬ 
nomial  functions  of  *  F^'",  and  the  the  quadratic  dependence  of  the 
step-constants  on  |  w}n^  |  and  A,-n^  is  likely  to  be  significant.  10 

In  the  next  chapter,  we  will  demonstrate  that  the  use  of  high  momen¬ 
tum  leads  to  shorter  convergence  time.  When  gain  modification  is  combined 
with  high  momentum,  there  is  further  improvement  in  convergence  time. 
Empirical  results  suggest  that  when  high  momentum  is  used  shorter  con¬ 
vergence  times  result  because  the  synapses  develop  more  rapidly.  When 
gain  modification  is  added  to  high  momentum,  the  rate  of  development  of 
the  effective  synapses,  is  accelerated  further. 


9The  initial  synaptic  weights  are  small:  between  -0.1  and  0.1,  and  the  connectivity  of 
the  networks  we  considered  in  the  next  chapter  was  small.  Additionally,  the  input  patterns 
were  confined  to  the  unit  disk.  Therefore,  we  can  expect  that  this  approximation  will  be 
valid  at  the  beginning  of  synaptic  development. 

10Furthermore,  note  that  the  ordinary  error  signals,  as  shown  earlier,  are  explicitly 
dependent  on  the  gains.  This  may  also  play  a  significant  role  in  the  time  evolution  of  the 
network. 
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Chapter  2 

Experimental  Benchmarks  for 
Gain  Modification  and 
M  omentum 


2.1  Momentum  and  the  Concentric  Circle 
Paradigm 

We  describe  below  some  benchmarks  for  the  concentric  circle  paradigm.  In 
this  paradigm,  we  train  a  backward  propagation  network  with  two  inputs 
representing  the  x  and  y  coordinates  of  a  point  in  the  unit  disk.  Within  the 
unit  disk,  there  are  two  pattern  classes  separated  by  a  circular  boundary 
at  r  =  -j=.  This  divides  the  class  into  two  regions,  an  inner  disk  and  an 
outer  annulus,  both  of  area  For  the  single  output  unit,  the  target  output 
for  patterns  in  the  outer  annulus  is  I  and  for  patterns  in  the  inner  disk  is 
0.  In  the  simulations  described  below,  the  network  had  one  hidden  layer 
of  6  hidden  units,  and  the  network  was  trained  on  40  patterns  randomly 
selected  from  the  unit  disk.  For  each  setting  of  the  parameters,  we  repeated 
the  experiment  90  times  starting  the  weights  at  different  random  points  in  a 
small  hypercube;  each  initial  weight  was  chosen  randomly  betweeen  0.1  and 
-0.1.  Additionally,  a  different  randomly  selected  set  of  40  patterns  was  used 
in  each  experiment.  In  all  experiments  the  step-constant  in  the  first  layer 
of  connections  was  two  times  the  size  of  the  step  constant  in  the  second 
layer.  We  have  found  that  this  accelerates  the  learning  procedure,  a  fact 
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which  is  consistent  with  what  Becker  and  Lecun  (1988)  [2]  have  observed. 
In  their  work,  they  determined  that  the  step  constant  should  be  scaled  by 
a  factor  of  ^=,  where  Ne,  for  a  given  layer  of  synapses,  is  the  n limber  of 
connections  in  that  layer  onto  a  neuron  in  the  next  layer  of  neurons.  If  this 
scaling  is  a  general  principal,  then,  for  our  particular  network  architecture, 
we  should  have  increased  the  step-constant  in  the  first  layer  of  synapses 
by  a  factor  of  \/3>  However,  by  choosing  a  factor  of  2,  we  are  at  least 
approximately  close  to  what  Becker  and  LeCun  found  to  be  ideal  for  the 
binary  input  paradigm  which  they  investigated.  As  of  yet,  we  have  not 
studied  in  great  detail  the  scaling  of  step-constants  as  a  function  of  the 
number  of  synaptic  connections  in  a  layer.  However,  for  consistency,  in  the 
simulations  described  in  the  next  section,  we  chose  a,  the  step-constant 
for  gain  modification,  to  be  twice  as  large  for  neurons  in  the  hidden  layer 
compared  to  its  value  in  the  output  layer. 

In  one  set  of  experiments  (series  ab),  we  took  the  step-constant  to  be 
rj  =  0.4.  The  momentum  was  varied  by  increments  of  0.1  from  0.0  to  0.9, 
and  90  experiments  were  performed  at  each  value  of  momentum.  The  mean 
and  standard  deviation  of  convergence  time  are  plotted  for  each  value  of 
momentum  in  figure  a.l  of  appendix  a.  Only  those  experiments  converging 
within  20000  epochs  1  2  were  used  to  compute  the  means  and  standard 
deviations.  From  the  graph,  it  is  apparent  that  the  mean  convergence 
time  decreases  with  increasing  momentum.  The  large  error  bars,  however, 
are  somewhat  misleading  and  tend  to  underemphasize  the  rather  dramatic 
improvement  in  convergence  time  at  high  momentum.  The  size  of  the  errors 
is  typically  comparable  to  the  mean  and  is  due  to  the  fact  that  there  are 
a  number  of  trials  for  which  the  convergence  time  is  several  times  larger 
than  for  the  majority  of  the  trials.  Given  the  limited  number  of  trials,  90, 
this  tends  to  make  the  error  bars  large  and  to  mask  the  improvement  of 
the  central  cluster  of  trials.  With  increasing  momentum,  the  central  cluster 
of  convergence  times  becomes  narrower  and  moves  toward  shorter  times. 
We  illustrate  this  point  in  a  series  of  graphs  for  series  ab,  figures  a.4-  a.13, 
ordered  by  momentum,  in  which  we  plot  a  histogram  of  number  of  trials 
converging  vs.  convergence  time. 

1  An  epoch  is  defined  to  be  a  complete  presentation  of  the  data  set  sequentially.  How¬ 
ever,  we  modify  the  network  following  the  presentation  of  each  pattern. 

2In  our  experiments,  we  defined  the  convergence  of  a  network  to  occur  when  the  network 
output  was  within  0.1  of  the  target  output  for  all  training  patterns. 
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It  is  interesting  to  note  that  at  low  momentum,  there  is  evidence  of 
at  least  two  distributions  in  the  convergence  times.  These  two  clusters 
gradually  move  to  shorter  convergence  times  and  merge  to  form  a  single 
cluster  at  low  convergence  time. 

Figure  a.2  of  appendix  a  shows  the  fraction  of  trials  converging  within 
20000  epochs.  Over  most  of  the  range  of  /c,  that  is  for  k  between  0.0  and 
0.8,  the  percentage  of  trials  converging  is  in  the  range  of«  89  —  98%.  At 
very  high  momentum,  however,  k  =  0.9,  the  percentage  of  trials  converg¬ 
ing  drops  to  79%.  At  the  same  time,  the  amount  of  generalization  by  the 
network  is  not  momentum  dependent.  This  is  readily  apparent  in  figure 
a.3  of  appendix  a,  in  which  we  plot  the  mean  and  standard  deviation  of 
the  fraction  of  5000  randomly  selected  test  patterns  which  were  correctly 
identified  by  the  network  after  the  network  converged.  We  also  graph  the 
fraction  which  could  be  correctly  identified  by  the  network’s  “best  guess” , 
that  is  the  output  which  the  network  was  closest  to,  0  or  1.  Means  and  stan¬ 
dard  deviations  are  depicted  only  for  those  trials  which  actually  converged 
within  20000  epochs.  As  indicated  by  the  small  error  bars,  the  generality 
of  the  solutions  varies  only  weakly  across  trials  for  a  given  value  of  mo¬ 
mentum.  Furthermore,  as  we  have  noted,  the  mean  generality  is  constant 
across  momentum.  We  expect,  therefore,  that  solution  generality  should 
not  have  a  strong  dependence  on  convergence  time. 

We  can  conclude  also  that  the  length  of  convergence  time  has  very  little 
to  do  with  the  generality  of  the  solution  reached  for  this  paradigm;  we 
infer  this  from  the  fact  that  the  error  bars  are  small  in  the  generalization 
graph  in  figure  a.3.  It  is  interesting  to  ask,  then,  which  features  might 
distinguish  the  network  solutions  with  long  convergence  time  from  those 
with  short  convergence  time.  In  appendix  b,  we  have  produced  a  scatter 
plot  of  the  mean  and  standard  deviation  (over  the  neurons  in  a  particular 
solution)  of  the  magnitude  of  synaptic  vectors  vs.  convergence  time  for 
series  ab.  As  defined  in  the  previous  chapter,  each  of  these  vectors  includes 
both  the  ordinary  synapse  and  the  bias.  For  each  layer,  there  is  a  set  of  3 
graphs,  ordered  in  increasing  momentum,  one  at  zero  momentum  (k  =  0), 
one  at  intermediate  momentum  ( k  =  0.5),  and  one  at  high  momentum 
(k  =  0.9).  The  graphs  for  layer  2  are  in  figures  b.l-  b.3,  and  those  for  layer 
1  are  in  figures  b.4-b.6.  We  note  that  as  the  momentum  is  increased,  the 
longer  convergence  times  move  toward  the  cluster  of  convergence  times  at 
shorter  times,  this  region  of  the  graph  becomes  more  dense,  and  itself  moves 
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simultaneously  closer  to  the  origin  along  the  time  axis.  As  a  consequence, 
as  the  momentum  is  increased  and  the  number  of  trials  converging  within 
the  first  several  thousand  iterations  becomes  more  dense,  we  observe  that, 
as  a  population,  the  synaptic  norms  rise  more  rapidly  as  a  function  of 
convergence  time. 


2.2  Gain  Modification  Combined  with  High 
Momentum  Synaptic  Modification 

We  have  seen  from  the  histograms  of  number  of  trials  converging  vs.  con¬ 
vergence  time  for  series  ab  in  appendix  a,  that  high  momentum  can  dra¬ 
matically  improve  the  convergence  rate  for  the  majority  of  the  trials.  With 
this  in  mind,  we  combined  the  gain  modification  procedure  described  ear¬ 
lier  with  high  momentum  synaptic  modification.  We  chose  a,  the  gain 
modification  step-constant,  to  be  an  order  of  magnitude  smaller  than  the 
synaptic  modification  step-constant,  17,  since  the  gain  is,  in  general,  a  more 
sensitive  parameter.  In  series  ag,  we  examined  three  different  cases:  no 
momentum  (sag3),  high  momentum  (sag2),  and  high  momentum  with  gain 
modification  (sag4,sag5,sag6).  In  each  run,  we  performed  500  experiments. 
We  summarize  the  parameters  chosen  for  these  runs:  8 


Run  Number 

V 

K 

a 

sag3 

0.3 

0.0 

0.000 

sag2 

0.3 

0.8 

0.000 

sag5 

0.3 

0.8 

0.020 

sag4 

0.3 

0.8 

0.038 

sag6 

0.3 

0.8 

0.060 

In  figure  c.l  and  c.2  of  appendix  c,  we  plot  the  percentage  of  trials  con¬ 
verging  vs.  convergence  time  on  two  different  time  scales  for  the  runs  listed 
above.  Detailed  histograms  of  sag2,  sag3,  and  sag4,  comparing  the  three 
cases  on  two  different  time  scales,  the  first  6000  epochs  (figures  c.3-c.5), 
and  the  full  20000  epochs  (figures  c.6-c.8),  are  also  provided  in  appendix  c. 

3  As  noted  earlier,  *7  — ►  2 rj  and  a-*2a  for  the  first  layer  of  synapses  and  the  gains  of 
the  hidden  units  respectively.  The  values  for  rf  and  o  in  the  table  are  those  values  used  in 
the  second  layer  of  synapses  and  the  gains  of  the  output  units  respectively. 
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Examining  the  graphs  in  figures  c.l  and  c.2,  we  see  that  without  momen¬ 
tum  or  gain  modification,  80%  of  the  simulations  converge  within  «  10000 
epochs.  High  momentum  clearly  leads  to  much  more  rapid  convergence, 
for  we  obtain  the  80%  level  within  «  1500  epochs.  When  gain  modifica¬ 
tion  is  added  to  high  momentum,  we  achieve  the  80%  level  within  «  500 
epochs.  This  is  3  times  faster  than  high  momentum  without  gain  modifi¬ 
cation  and  20  times  faster  than  the  bare  algorithm.  This  is  a  tantalizing 
result,  for  on  a  more  complex  problem  where  longer  convergence  times  are 
more  probable,  reaching  the  80%  level  in  a  third  of  the  time  (compared 
with  high  momentum  alone)  could  be  quite  significant.  Asymptotically, 
the  runs  in  which  gain  modification  was  combined  with  high  momentum 
(sag4,  sag5,  sag6)  also  achieved  the  highest  percentage  of  trials  converging, 
«  95-98%.  For  high  momentum  without  gain  modification,  the  asymptotic 
convergence  rate  was  slightly  lower  at  «  93%,  and  for  the  bare  algorithm, 
even  less  at  «  90%. 

A  comparison  of  the  synapses  obtained  in  these  three  cases  is  illustrated 
in  the  graphs  in  appendix  d.  Figures  d.l,  d.2,  and  d.3  graph  the  the  synap¬ 
tic  vector  norm  4  in  layer  2  vs.  convergence  time.  Figure  d.l  is  the  no 
momentum  case  (sag3),  figure  d.2  the  high  momentum  case  (sag2),  and 
figure  d.3  the  high  momentum  with  gain  modification  case  (sag4).  As  a 
population  there  is  a  trend  toward  a  longer  synaptic  norm  in  layer  2  as 
time  increases  when  the  bare  algorithm  is  used  (figure  d.l).  This  increase 
in  synaptic  norm  occurs  more  rapidly  when  high  momentum  (figure  d.2) 
is  added  to  the  bare  algorithm.  When  we  combine  high  momentum  with 
gain  modification,  (figure  d.3)  there  is  some  depression  of  the  size  of  the 
synaptic  norm.  In  this  case,  however,  the  length  of  the  effective  synap¬ 
tic  vector,  or  rescaled  synaptic  vector,  depicted  in  figure  d.4  in  appendix 
d,  develops  much  more  rapidly  than  in  the  high  momentum  case  without 
gain  modification.  This  is  due  tc  the  fact  that  the  gains  in  this  layer  as  a 
population  are  typically  «  1.4-2. 5  (see  figure  d.5  of  appendix  d  where  the 
gains  are  plotted  as  a  function  of  synaptic  norm  for  layer  2) .  This  seems  to 
speed  the  process  of  learning,  since  solutions  with  longer  synaptic  norms 
are  achieved  without  the  requirement  of  the  synapses’  becoming  large;  a 
single  multiplicative  factor,  the  gain,  hastens  the  change  of  the  length  of 
the  synaptic  vector. 

4As  before,  this  vector  includes  both  the  ordinary  synaptic  vector  and  the  bias. 
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In  layer  1,  a  similar  trend  obtains  (figures  d.6-  d.9).  5  As  in  layer  2, 
longer  effective  synapses  in  layer  1  are  achieved  more  rapidly  by  modifying 
the  gain  parameter.  When  we  compare  figures  d.6  and  d.7,  we  see  that 
the  increase  in  E(\  |)  occurs  more  rapidly  as  a  function  of  convergence 

time  for  the  majority  of  the  population  as  we  go  from  no  momentum  to  high 
momentum. 6  However,  the  synaptic  norms  (figure  d.8)  are  not  significantly 
depressed  in  the  case  of  high  momentum  with  gain  modification;  this  leads 
to  rescaled  synaptic  vector  norms  which  are  much  larger  than  the  other  two 
cases  (see  figure  d.9).  7  In  figure  d.8,  the  ordinary  synapses  are  probably 
not  significantly  depressed  in  layer  1  because  the  step-constant  was  twice 
as  large  for  layer  1,  as  we  noted  earlier. 

The  mean  gain  is  plotted  versus  mean  synaptic  vector  norm  in  figure 
d.10.  The  gains  are  principally  between  1.0  -  2.0  in  this  layer  and  act  to 
accelerate  the  development  of  the  length  of  the  effective  synaptic  vectors. 

Rapid  development  of  the  synaptic  vector  norms  appears  to  be  the  factor 
which  decreases  the  convergence  time  when  high  momentum  is  used,  and 
this  effect  is  further  enhanced  when  gain  modification  is  used  with  high 
momentum.  We  note  also  for  completeness  that  the  use  of  gain  modification 
did  not  in  any  way  degrade  the  generality  of  the  solutions  obtained. 


2.3  Future  Directions 

In  the  work  described  above,  we  have  shown  how  gain  modification  can 
enhance  the  improvements  in  convergence  rate  obtained  with  high  momen¬ 
tum  in  synaptic  modification  for  a  simple  concentric  circle  paradigm.  For 
completeness,  we  plan  to  do  simulations  in  which  gain  modification  is  used 
without  momentum.  In  addition,  further  analysis  and  reduction  of  the  data 
already  obtained  is  in  progress.  We  plan  to  establish  similar  benchmarks  for 
higher  dimensional  problems,  for  instance  for  higher  dimensional  concen¬ 
tric  hyperspheres.  This  will  allow  a  comparison  of  how  the  improvements 
obtained  with  gain  modification  scale  with  dimensionality.  Finally,  we  will 
establish  benchmarks  for  gain  modification  for  more  complex  problems  such 

6 As  before  error  bars  represent  the  standard  deviation  with  respect  to  the  neurons  in 
the  layer. 

eE  is  the  expected  value,  or  mean. 

7Note  that  the  scale  of  the  graph  in  figure  d.9  is  different  from  the  previous  graphs. 
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as  parity. 
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Appendix  a 

Graphs  of  series  ab:  Convergence  Properties  as  a  Function  of 

Momentum 
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Appendix  b 

Graphs  of  series  ab:  Magnitude  of  Synaptic  Vectors  vs. 

Convergence  Time 
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Appendix  c 

Graphs  of  series  ag:  Convergence  Properties  for 
(l)No  Momentum,  (2)  High  Momentum, 

(3)  High  Momentum  with  Gain  Modification 
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Appendix  d 

Graphs  of  series  ag:  Magnitude  of  Synaptic  Vectors  vs 

Convergence  Time 
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